In [1]:
from IPython.display import HTML
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import geopandas as gpd
from shapely.geometry import Point
import sys
sys.path.insert(0, '../src/processing/')
import tools
import os.path
import definitions
With the large volume of CTA bus location data I have collected so far in 2019, I want to develop a process for analyzing the data on a neighborhood and city level. My aim is to measure and compare the quality of bus service within each Chicago community area. In this notebook, I begin to develop a process for neighborhood-level analysis by working with data for the bus routes that service Logan Square, specifically restricting my analysis to the geopositional data and bus stops within the boundaries of the neighborhood. To start, I estimate the average wait times for each bus route servicing Logan Square, as well as the proportion of the trips on each route that experience bus bunching in the neighborhood.
Logan Square is served by the following bus routes:
I use data from 73 Armitage to develop the initial data processing workflow and then combine the code into reusable functions.
GeoDataFrame
The geospatial boundaries of Chicago's 77 community areas can be found at the Chicago Data Portal.
In [2]:
commareas = gpd.read_file("../data/raw/geofences/Boundaries - Community Areas (current).geojson")
commareas.plot()
commareas.head()
Out[2]:
I'm only concerned about the boundaries of Logan Square for the rest of the notebook.
In [3]:
losq = commareas[commareas.community == "LOGAN SQUARE"]
losq.plot()
Out[3]:
GeoDataFrame
I load pattern data for Route 73 Armitage, which details the locations of each of its bus stops. The plan is to identify only those bus stops which are in Logan Square by performing a spatial join on this data set with losq
. Since the raw pattern data is not stored in a geospatial file format---they are regular JSON files---I cannot directly load it into a GeoDataFrame
. I first load it into a regular DataFrame
, then create geometry objects using the latitude and longitudes columns, and finally load the DataFrame
and geometry into a GeoDataFrame
.
In [4]:
patterns = tools.load_patterns(73, False)
patterns.head()
Out[4]:
In [5]:
geometry = [Point(xy) for xy in zip(patterns.lon, patterns.lat)]
patterns_gdf = gpd.GeoDataFrame(patterns, geometry=geometry, crs={'init': 'epsg:4326'})
patterns_gdf.plot()
patterns_gdf.head()
Out[5]:
In [6]:
losq_stops = gpd.sjoin(patterns_gdf, losq, how="inner", op="intersects")
losq_stops.plot()
Out[6]:
In [7]:
travels_waits = tools.load_travels_waits(73, "Westbound", "201901")
travels_waits.head()
Out[7]:
Because bus service varies throughout the day, it makes sense to analyze bus service during different time intervals. The CTA defines four weekday service intervals in its Service Standards and Policies document:
I also go ahead and drop all of the rows where the origin bus stop is not within Logan Square.
In [8]:
cta_time_periods = [6, 9, 15, 19, 22]
cta_time_period_labels = ["AM Peak", "Midday", "PM Peak", "Evening"]
travels_waits["bins"] = pd.cut(travels_waits.decimal_time, cta_time_periods, labels=cta_time_period_labels, right=False)
travels_waits.drop(travels_waits[~travels_waits.origin.isin(losq_stops.stpid)].index, inplace=True)
travels_waits.head()
Out[8]:
As a bus completes its route, the wait time between it and the next bus will vary from stop to stop. First, I find the average wait time at the Logan Square bus stops for each Armitage bus trip. Then I average over those averages to find the average wait time for the Armitage bus in Logan Square. As explained in the documentation, each trip is uniquely identified by its "start_date", "pid", and "tatripid". I further aggregate the data by time of day.
Note: For each departure from an origin stop, the travels_waits DataFrame
gives the wait time (in minutes) for the next bus arrival heading to a particular terminal stop. Since some routes have more than one terminus, there may be more than one calculated wait time. Each column "wait|STPID_1", ..., "wait|STPID_N" gives the wait time for the next bus heading to some terminal with stpid
STPID_M.
In [9]:
terminals = travels_waits.columns[travels_waits.columns.str.contains("wait\|")]
trip_means = travels_waits.groupby(["start_date", "pid", "tatripid", "bins"])[terminals].mean().reset_index()
The travels_waits data files are already separated by direction of travel. Explicitly naming and grouping by the direction of travel is just for convenience when reading the printed output.
In [10]:
trip_means["rtdir"] = "Westbound"
trip_means.groupby(["rtdir", "bins"])[terminals].mean()
print trip_means.groupby(["rtdir", "bins"])[terminals].mean()
As expected, the morning and evening rush service has the shortest wait times, while service in the later evening has the longest wait times. The Armitage bus is classified as a "Support Route" by the CTA, so it has longer wait times and a narrower window of service than "Key Routes" like Route 74 Fullerton.
To find the proportion of trips that were "bunched", first count the number of bunching incidents during each time interval, then countx the total number of trips during each time interval, and divide the two results.
I arbitrarily classify any trip as bunched if the wait time between it and the next bus (heading toward or through the same terminal stop) is less than 2 minutes. In the future, I could define bus bunching in a more nuanced way--such as defining the bunching threshold in terms of the route's scheduled wait times--but this definition works for a first-round exploration and analysis.
In [11]:
# bunching incidents
masked = trip_means.copy()
masked[terminals] = (masked[terminals] < 2)
masked.groupby(["rtdir", "bins"])[terminals].sum()
Out[11]:
In [12]:
# total number of trips
trip_means.groupby(["rtdir", "bins"])[terminals].count()
Out[12]:
In [13]:
# proportion of trips that were bunched
masked.groupby(["rtdir", "bins"])[terminals].sum() / trip_means.groupby(["rtdir", "bins"])[terminals].count()
Out[13]:
Route 73 experiences very little bus bunching under my definition. This shouldn't be surprising: the headways throughout the day on Route 73 are all over 13 minutes on average.
In [14]:
# Travels/waits data files are separated by route and month
dates = [
"201901",
"201902",
"201903",
"201904",
]
# Bus routes going through Logan Square
rts = ["49","X49","52","53","56","73","74","76","82"]
cta_holidays = pd.DatetimeIndex(definitions.HOLIDAYS)
cta_time_periods = [6,9,15,19,22]
cta_time_period_labels = ["AM Peak", "Midday", "PM Peak", "Evening"]
def load_community_area(path, ca):
commareas = gpd.read_file(path)
return commareas[commareas.community == ca.upper()]
def patterns_to_gdf(rt, waypoints):
patterns = tools.load_patterns(rt, waypoints)
geometry = [Point(xy) for xy in zip(patterns.lon, patterns.lat)]
return gpd.GeoDataFrame(patterns, geometry=geometry, crs={'init': 'epsg:4326'})
In [15]:
losq_mean_waits = []
losq_bunching = []
losq = load_community_area("../data/raw/geofences/Boundaries - Community Areas (current).geojson", "LOGAN SQUARE")
for rt in rts:
patterns_gdf = patterns_to_gdf(rt, False)
losq_stops = gpd.sjoin(patterns_gdf, losq, how="inner", op="intersects")
colname_map = {"wait|{}".format(stpid): stpnm for stpid, stpnm in zip(patterns_gdf.stpid, patterns_gdf.stpnm)}
for rtdir in losq_stops.rtdir.unique():
tws = pd.concat((tools.load_travels_waits(rt, rtdir.lower(), d) for d in dates), ignore_index=True)
# I further restrict the analysis to non-holiday weekdays,
# since frequency of service is different on holidays and weekends
tws.drop(tws[(tws.start_date.dt.dayofweek >= 5) | tws.start_date.isin(cta_holidays)].index, inplace=True)
tws.drop(tws[~tws.origin.isin(losq_stops.stpid)].index, inplace=True)
tws.dropna(axis='columns', how='all', inplace=True)
terminals = [colname_map[col] for col in tws.columns[tws.columns.str.contains("wait\|")]]
tws.rename(columns=colname_map, inplace=True)
tws["bins"] = pd.cut(tws.decimal_time, cta_time_periods, labels=cta_time_period_labels, right=False)
rt_waits = tws.groupby(["start_date", "pid", "tatripid", "bins"])[terminals].mean().reset_index()
rt_waits["rtdir"] = rtdir
bunched = rt_waits.copy()
bunched[terminals] = (bunched[terminals] < 2)
bunching_incidents = bunched.groupby(["rtdir", "bins"])[terminals].sum()
tot_trips = rt_waits.groupby(["rtdir", "bins"])[terminals].count()
rt_mean_waits = rt_waits.groupby(["rtdir", "bins"])[terminals].mean()
rt_bunching = bunching_incidents / tot_trips
losq_mean_waits.append((rt, rtdir, rt_mean_waits))
losq_bunching.append((rt, rtdir, rt_bunching))
print rt, rtdir
print rt_mean_waits
print "bus bunching statistics"
print rt_bunching
Scanning the output, there are a couple of routes with very high wait times. Route 53 Pulaski buses heading toward Pulaski & Harrison or Irving Park & Keystone only make trips at night, hence the long wait times for those buses during the day. Similarly, Route 82 Kimball-Homan buses only make trips to Devon & Kedzie at the start and end of their service day.
In [16]:
outliers = ["Pulaski & Harrison (Blue Line)", "Irving Park + Keystone", "Devon + Kedzie Terminal"]
all_waits = pd.concat(
(pd.melt(x[2].reset_index(), id_vars=["rtdir", "bins"], var_name="terminal", value_name="wait_time").assign(rt=x[0]) for x in losq_mean_waits),
ignore_index=True
)
all_waits = all_waits[~all_waits.terminal.isin(outliers)]
In [17]:
all_waits.loc[all_waits.wait_time.nlargest(3).index]
Out[17]:
In [18]:
print "Longest wait times:"
print all_waits.loc[all_waits.wait_time.nlargest(6).index]
In [19]:
print "Shortest wait times:"
print all_waits.loc[all_waits.wait_time.nsmallest(3).index]
In [20]:
all_bunches = pd.concat(
(pd.melt(x[2].reset_index(), id_vars=["rtdir", "bins"], var_name="terminal", value_name="bunch_ratio").assign(rt=x[0]) for x in losq_bunching),
ignore_index=True
)
all_bunches = all_bunches[~all_bunches.terminal.isin(outliers)]
print "Overall worst bus bunching:"
print all_bunches.loc[all_bunches.bunch_ratio.nlargest(3).index]
In [21]:
print "Most bunching during morning rush hour:"
print all_bunches.loc[all_bunches[all_bunches.bins == "AM Peak"].bunch_ratio.nlargest(3).index]
print "\nLeast bunching during morning rush hour:"
print all_bunches.loc[all_bunches[all_bunches.bins == "AM Peak"].bunch_ratio.nsmallest(3).index]
In [22]:
print "Most bunching during evening rush hour:"
print all_bunches.loc[all_bunches[all_bunches.bins == "PM Peak"].bunch_ratio.nlargest(3).index]
print "\nLeast bunching during evening rush hour:"
print all_bunches.loc[all_bunches[all_bunches.bins == "PM Peak"].bunch_ratio.nsmallest(3).index]